Environment

1 Dimension Overview

The first plot shows all the environment indicators from both the current studies and the original framework in the y-axis. Purple indicates that the indicator is only being used in the current studies, orange that it is only included in the Wiltshire framework, and green that the indicator is used in both the framework and current studies.

The x-axis shows the number of secondary data metrics that have been collected to represent those indicators. You can see that there are some indicators for which there exist many data, but many indicators for which I have found little to represent them.

NASS figures are used to cover on-farm water use, energy efficiency, and acres in conservation practices. I used the National Aquatic Resource Surveys aggregated at the state level to measure water quality. Land use diversity is pretty well represented by Multi-Resolution Land Characteristics LULC layers, which I also aggregated at the county level. Greenhouse gas emissions come from EPA figures by state, broken down by economic sector. Finally, the USFS TreeMap dataset accounts for aboveground biomass and would do reasonably well in tree vigor. There is more to pull out here than I have so far.

Otherwise, if anyone has ideas for secondary datasets to cover the rest of the indicators, please do let me know.

Code
pacman::p_load(
  dplyr,
  ggplot2,
  stringr,
  plotly,
  RColorBrewer
)

## Load data for tree and metrics
env_tree <- readRDS('data/trees/env_tree.rds')

meta <- readRDS('data/sm_data.rds')[['metadata']] %>% 
  filter(dimension == 'environment')

# Format to match Wiltshire framework
meta <- meta %>% 
  mutate(
    indicator = str_to_sentence(indicator),
    indicator = case_when(
      str_detect(indicator, '^Above') ~ 'Aboveground biomass',
      str_detect(indicator, '^Water') ~ 'Water use / irrigation efficiency',
      TRUE ~ indicator
    )
  ) 

# Counts of secondary data metrics
counts <- meta %>% 
  group_by(indicator) %>% 
  dplyr::summarize(count = n())

# Join to Wiltshire framework
colors <- RColorBrewer::brewer.pal(n = 3, name = 'Dark2')
dat <- full_join(env_tree, counts, by = join_by(Indicator == indicator)) %>% 
  mutate(
    count = ifelse(is.na(count), 0, count),
    label_color = case_when(
      Use == 'both' ~ colors[1],
      Use == 'wiltshire_only' ~ colors[2],
      Use == 'current_only' ~ colors[3]
    )
  )

# Plot
dat %>%
  ggplot(aes(x = Indicator, y = count)) +
  geom_col(
    color = 'black',
    fill = 'grey'
  ) +
  geom_point(
    data = dat,
    aes(x = 1, y = 1, color = Use),
    inherit.aes = FALSE
  ) +
  scale_color_manual(
    name = "Indicator Use:",
    values = c(
      "both" = colors[1],
      "current_only" = colors[3],
      "wiltshire_only" = colors[2]
    ),
    labels = c(
      'Both',
      'Current Only',
      'Framework Only'
    )
  ) +
  theme_classic() +
  theme(
    axis.text.y = element_text(color = dat$label_color),
    legend.position = "bottom",
    plot.margin = margin(t = 10, r = 75, b = 10, l = 10)
  ) +
  guides(
    color = guide_legend(override.aes = list(size = 3))
  ) +
  coord_flip() +
  labs(y = 'Secondary Data Count')

Bar Plot of Indicators

2 Maps

Taking a quick tour through some of the spatial data here. I’m not including full rasters of LULC or TreeMap layers to conserve space, but I have included some derived metrics by county. With the exception of hotspot and species atlas data, these will also be up on the Shiny app along with all the other metrics.

Land Use Diversity

This is derived from the USGS MRLC 30m LULC layer for 2023. LULC types are aggregated by category (water, developed, barren, forest, shrubland, herbaceous, cultivated, wetlands) and Shannon diversity is calculated for each county.

Code
pacman::p_load(
  mapview,
  dplyr,
  sf,
  leaflet,
  leafpop,
  viridisLite
)

div <- readRDS('data/sm_data.rds')[['lulc_div']]

mapview(
  div,
  zcol = 'lulc_div',
  label = 'county_name',
  layer.name = 'LULC Diversity',
  popup = popupTable(
    div,
    zcol = c(
      'county_name',
      'lulc_div'
    ),
    row.numbers = FALSE,
    feature.id = FALSE
  )
)

Land Use Land Cover Diversity

Biodiversity Hotspots

Again, this biodiversity hotspot map was being put together around the same time as the Y2k crisis. Even if this were more recent and throughout New England, incoporating this kind of data into the framework seems a bit fraught.

Code
hotspots <- readRDS('data/sm_data.rds')[['hotspots']]
mapview(hotspots, col.regions = '#154734')

Biodiversity Hotspots Map

Forest Biomass

The TreeMap 2016 dataset is quite comprehensive national survey of forest health and diversity. Updates are infrequent, but this is the best layer I’ve found to address biomass. The raster is at 30m. Shown below is the mean live above-ground biomass aggregated by county so that it plays well with other metrics. Note that it is measured in tons per acre of forest, non-forest cells were removed from analysis. So, it is not showing density of forest, just biomass in existing forest. This is why the more urban counties still show a reasonable density of live biomass. There is lots more that can be pulled out of this dataset, like dead/down carbon, tree stocking, live canopy cover, height, volume, tree per acre, etc. More info can be found here.

Code
pacman::p_load(
  mapview,
  dplyr,
  sf,
  viridisLite,
  leaflet,
  leafpop
)
biomass <- readRDS('data/sm_data.rds')[['mean_biomass']]
mapview(
  biomass,
  zcol = 'mean_biomass',
  layer.name = 'Mean Live Above<br>Ground Biomass<br>(tons per acre)',
  label = 'county_name',
  popup = popupTable(
    biomass,
    zcol = c(
      'county_name',
      'mean_biomass'
    ),
    feature.id = FALSE,
    row.numbers = FALSE
  )
)

Map of aboveground forest biomass by county

3 Metadata Table

This table includes all secondary data metrics. Filter by dimension to get just environment metrics.

Using the table:

  • Click column headers to sort
  • Global search in the top right, or column search in each header
  • Change page length and page through results at the bottom
  • Use the download button to download a .csv file of the filtered table
  • Click the arrow on the left of each row for details, including a URL to the data source.
Code
pacman::p_load(
  dplyr,
  reactable,
  stringr,
  htmltools
)

# Load full metadata table
metadata_all <- readRDS('data/sm_data.rds')[['metadata']]

# Pick out variables to display
metadata <- metadata_all %>% 
  select(
    metric,
    'Variable Name' = variable_name,
    definition,
    dimension,
    index,
    indicator,
    units,
    'Year' = latest_year, # Renaming latest year as year, not including og year
    source,
    scope,
    resolution,
    url
) %>% 
  setNames(c(str_to_title(names(.))))

###
htmltools::browsable(
  tagList(
    
    tags$div(
      style = "display: flex; gap: 16px; margin-bottom: 20px; justify-content: center;",
      
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Show/hide more columns"),
        onclick = "Reactable.setHiddenColumns('metadata_table', prevColumns => {
          return prevColumns.length === 0 ? ['Definition', 'Scope', 'Resolution', 'Url'] : []
        })"
      ),
      
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Download as CSV"),
        onclick = "Reactable.downloadDataCSV('metadata_table', 'sustainability_metadata.csv')"
      )
    ),
    
    reactable(
      metadata,
      sortable = TRUE,
      resizable = TRUE,
      filterable = TRUE,
      searchable = TRUE,
      pagination = TRUE,
      bordered = TRUE,
      wrap = TRUE,
      rownames = FALSE,
      onClick = 'select',
      striped = TRUE,
      pageSizeOptions = c(5, 10, 25, 50, 100),
      defaultPageSize = 5,
      showPageSizeOptions = TRUE,
      highlight = TRUE,
      style = list(fontSize = "14px"),
      compact = TRUE,
      columns = list(
        Metric = colDef(
          minWidth = 200,
          sticky = 'left'
        ),
        'Variable Name' = colDef(
          minWidth = 150
        ),
        Definition = colDef(
          minWidth = 250
        ),
        'Latest Year' = colDef(minWidth = 75),
        Source = colDef(minWidth = 250),
        Scope = colDef(show = FALSE),
        Resolution = colDef(show = FALSE),
        Url = colDef(
          minWidth = 300,
          show = FALSE
        )
      ),
      defaultColDef = colDef(minWidth = 100),
      elementId = "metadata_table",
      details = function(index) {
        div(
          style = "padding: 15px; border: 1px solid #ddd; margin: 10px 0;
             background-color: #E0EEEE; border-radius: 10px; border-color: black;
             box-shadow: 2px 2px 10px rgba(0, 0, 0, 0.1);",
          
          tags$h4(
            strong("Details"), 
          ),
          tags$p(
            strong('Metric Name: '), 
            as.character(metadata_all[index, 'metric']),
          ),
          tags$p(
            strong('Variable Name: '), 
            as.character(metadata_all[index, 'variable_name']),
          ),
          tags$p(
            strong('Definition: '), 
            as.character(metadata_all[index, 'definition']),
          ),
          tags$p(
            strong('Source: '), 
            as.character(metadata_all[index, 'source'])
          ),
          tags$p(
            strong('Latest Year: '), 
            as.character(metadata_all[index, 'latest_year'])
          ),
          tags$p(
            strong('All Years (cleaned, wrangled, and included): '), 
            as.character(metadata_all[index, 'year'])
          ),
          tags$p(
            strong('Updates: '), 
            str_to_title(as.character(metadata_all[index, 'updates']))
          ),
          tags$p(
            strong('URL: '), 
            tags$a(
              href = as.character(metadata_all[index, 'url']),
              target = '_blank',
              as.character(metadata_all[index, 'url'])
            )
          )
        )
      }
    )
  )
)

4 Data Table

Code
pacman::p_load(
  dplyr,
  reactable,
  stringr,
  htmltools
)

# Load metrics and metadata
metadata_all <- readRDS('data/sm_data.rds')[['metadata']]
metrics <- readRDS('data/sm_data.rds')[['metrics']]
fips_key <- readRDS('data/sm_data.rds')[['fips_key']]

# Value formatting function based on units
source('dev/format_values.R')

# Filter to economics metrics, join with metadata and county fips codes
env_metrics <- metrics %>% 
  left_join(metadata_all, by = join_by('variable_name')) %>% 
  filter(dimension == 'environment') %>% 
  left_join(fips_key, by = join_by('fips')) %>% 
  mutate(county_name = ifelse(is.na(county_name), state_name, county_name)) %>% 
  format_values() %>% 
  select(
    metric,
    'Variable Name' = variable_name,
    definition,
    year = year.x,
    Area = county_name,
    units,
    value
  ) %>% 
  setNames(c(str_to_title(names(.)))) %>% 
  filter(!is.na(Value))


## Reactable table
htmltools::browsable(
  tagList(
    
    tags$div(
      style = "display: flex; gap: 16px; margin-bottom: 20px; justify-content: center;",
      tags$button(
        class = "btn btn-primary",
        style = "display: flex; align-items: center; gap: 8px; padding: 8px 12px;",
        tagList(fontawesome::fa("download"), "Download as CSV"),
        onclick = "Reactable.downloadDataCSV('metrics_table', 'sustainability_metrics.csv')"
      )
    ),
    
    reactable(
      env_metrics,
      sortable = TRUE,
      resizable = TRUE,
      filterable = TRUE,
      searchable = TRUE,
      pagination = TRUE,
      bordered = TRUE,
      wrap = TRUE,
      rownames = FALSE,
      onClick = 'select',
      striped = TRUE,
      pageSizeOptions = c(5, 10, 25, 50, 100),
      defaultPageSize = 5,
      showPageSizeOptions = TRUE,
      highlight = TRUE,
      style = list(fontSize = "14px"),
      compact = TRUE,
      columns = list(
        Metric = colDef(
          minWidth = 125,
          sticky = 'left'
        ),
        'Variable Name' = colDef(
          minWidth = 125
        ),
        Definition = colDef(
          minWidth = 250
        ),
        Units = colDef(minWidth = 100),
        'Year' = colDef(minWidth = 100)
      ),
      defaultColDef = colDef(minWidth = 100),
      elementId = "metrics_table"
    )
  )
)

5 Distribution Plots

By County

Note that while most of the available secondary data is at the county level, the environment dimension includes a fair amount at the state level as well. This includes greenhouse gas emissions and water quality surveys. For now, I’ll just show these separately, but some creative aggregation will have to happen eventually.

Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)
source('dev/data_pipeline_functions.R')
source('dev/filter_fips.R')
metrics <- readRDS('data/sm_data.rds')[['metrics']]
metadata <- readRDS('data/sm_data.rds')[['metadata']]

# Use metadata to get help filter by dimension
env_meta <- metadata %>% 
  filter(dimension == 'environment')

# Filter to economics dimension
env_metrics <- metrics %>% 
  filter(variable_name %in% env_meta$variable_name)

# env_metrics$variable_name %>% unique
# get_str(env_metrics)

# Filter to latest year and new (post-2024) counties
# And pivot wider so it is easier to get correlations
env_county <- env_metrics %>%
  filter_fips(scope = 'counties') %>% 
  get_latest_year() %>% 
  select(fips, variable_name, value) %>% 
  mutate(variable_name = str_split_i(variable_name, '_', 1)) %>% 
  pivot_wider(
    names_from = 'variable_name',
    values_from = 'value'
  ) %>% 
  unnest(!fips) %>% 
  mutate(across(c(2:last_col()), as.numeric))
# get_str(env_county)

## Plot
plots <- map(names(env_county)[-1], \(var){
  if (is.character(env_county[[var]])) {
    env_county %>% 
      ggplot(aes(x = !!sym(var))) + 
      geom_bar(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else if (is.numeric(env_county[[var]])) {
    env_county %>% 
      ggplot(aes(x = !!sym(var))) + 
      geom_density(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else {
    return(NULL)
  }
}) 


# Arrange them in 4 columns
ggarrange(
  plotlist = plots,
  ncol = 4,
  nrow = 7
)

Distributions of economic metrics at the county level.

By State

Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)

state_codes <- readRDS('data/sm_data.rds')[['fips_key']] %>% 
  select(fips, state_code)

env_state <- env_metrics %>%
  filter_fips(scope = 'state') %>% 
  get_latest_year() %>% 
  select(fips, variable_name, value) %>% 
  mutate(variable_name = str_split_i(variable_name, '_', 1)) %>% 
  pivot_wider(
    names_from = 'variable_name',
    values_from = 'value'
  ) %>% 
  unnest(!fips) %>% 
  mutate(across(c(2:last_col()), as.numeric)) %>% 
  left_join(state_codes, by = 'fips')
# get_str(env_state)

# Variables to map. Take out some that didn't come through well.
vars <- names(env_state)[-1] %>% 
  str_subset(
    'lakesAcidCond|lakesCylsperEpaCond|lakesMicxEpaCond|state_code|waterIrrSrcOffFarmExp|waterIrrReclaimedAcreFt|waterIrrReclaimedOpenAcres',
    negate = TRUE
  )

## Plot
plots <- map(vars, \(var){
  env_state %>% 
    ggplot(aes(y = !!sym(var), x = state_code, color = state_code)) + 
    geom_point(
      alpha = 0.5,
      size = 3
    ) +
    theme_classic() +
    theme(
      plot.margin = unit(c(rep(0.5, 4)), 'cm'),
      legend.position = 'none'
    ) +
    labs(
      x = 'State'
    )
}) 

# Arrange them in 4 columns
ggarrange(
  plotlist = plots,
  ncol = 4,
  nrow = 17
)

Distributions of environmental variables at state level

6 Bivariate Plots

Using a selection of variables at the county level. The variable names are a bit hard to fit in here, but from left to right across the top they are LULC diversity, mean live above-ground forest biomass, conservation income per farm, conservatino easement acres per farm, conservation tillage: no-till acres per farm, conservation tillage: excluding no-till acres per farm, and cover cropping: excluding CRP acres per farm.

Code
pacman::p_load(
  GGally
)

# Neat function for mapping colors to ggpairs plots
# https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
map_colors <- function(data,
                       mapping,
                       method = "p",
                       use = "pairwise",
                       ...) {
  # grab data
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # calculate correlation
  corr <- cor(x, y, method = method, use = use)
  colFn <- colorRampPalette(c("blue", "white", "red"), interpolate = 'spline')
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]
  
  # correlation plot
  ggally_cor(data = data, mapping = mapping, color = 'black', ...) +
    theme_void() +
    theme(panel.background = element_rect(fill = fill))
}

lower_function <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_point(alpha = 0.5) +
    geom_smooth(color = "blue", fill = "grey", ...) +
    theme_bw()
}

# Rename variables to be shorter
env_county %>%
  select(
    LULC = lulcDiversity,
    Biomass = meanAboveGrndForBiomass,
    consIncomePF,
    consEasementAcresPF,
    consTillNoTillAcresPF,
    consTillExclNoTillAcresPF,
    coverCropExclCrpAcresPF
  ) %>%
  ggpairs(
    upper = list(continuous = map_colors),
    lower = list(continuous = lower_function),
    axisLabels = 'show'
  ) + 
  theme(
    strip.text = element_text(size =  5),
    axis.text = element_text(size =   5),
    legend.text = element_text(size = 5)
  )

7 Correlations

Only showing correlations by county because we don’t have enough observations to run it by state.

Code
pacman::p_load(
  dplyr,
  tidyr,
  tibble,
  stringr,
  purrr,
  tidyr,
  ggplot2,
  plotly,
  reshape,
  Hmisc,
  viridisLite
)

# get_str(env_county)

cor <- env_county %>% 
  select(-fips) %>% 
  as.matrix() %>% 
  rcorr()

# Melt correlation values and rename columns
cor_r <- melt(cor$r) %>% 
  setNames(c('var_1', 'var_2', 'value'))

# Save p values
cor_p <- melt(cor$P)
p.value <- cor_p$value

# Make heatmap with custom text aesthetic for tooltip
plot <- cor_r %>% 
  ggplot(aes(var_1, var_2, fill = value, text = paste0(
    'Var 1: ', var_1, '\n',
    'Var 2: ', var_2, '\n',
    'Correlation: ', format(round(value, 3), nsmall = 3), '\n',
    'P-Value: ', format(round(p.value, 3), nsmall = 3)
  ))) + 
  geom_tile() + 
  scale_fill_viridis_c() + 
  theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
  labs(
    x = NULL,
    y = NULL,
    fill = 'Correlation'
  )

# Convert to interactive plotly figure with text tooltip
ggplotly(
  plot, 
  tooltip = 'text',
  width = 1000,
  height = 800
)

Interactive correlation plot of metrics by county

8 PCA

Again, this is only at the county level. First, imputing missing data.

Code
pacman::p_load(
  missForest
)

# Wrangle dataset. Need all numeric vars or factor vars. And can't be tibble
# Also removing character vars - can't use these in PCA
dat <- env_county %>%
  select(where(is.numeric)) %>%
  as.data.frame()
# get_str(dat)

# Impute missing variables
set.seed(42)
mf_out <- dat %>%
  missForest(
    ntree = 200,
    mtry = 10,
    verbose = FALSE,
    variablewise = FALSE
  )

# Save imputed dataset
imp <- mf_out$ximp

# Print OOB
mf_out$OOBerror
   NRMSE 
0.697027 
Code
pacman::p_load(
  psych
)
VSS(imp)


Very Simple Structure
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.81  with  1  factors
VSS complexity 2 achieves a maximimum of 0.94  with  2  factors

The Velicer MAP achieves a minimum of 0.06  with  8  factors 
BIC achieves a minimum of  -296  with  8  factors
Sample Size adjusted BIC achieves a minimum of  277.68  with  8  factors

Statistics by number of factors 
  vss1 vss2   map dof chisq
1 0.81 0.00 0.145 350  2179
2 0.76 0.94 0.076 323  1504
3 0.74 0.94 0.075 297  1299
4 0.58 0.89 0.070 272  1106
5 0.55 0.85 0.069 248   919
6 0.52 0.82 0.072 225   704
7 0.51 0.83 0.056 203   641
8 0.50 0.80 0.056 182   492
                                                                                                                                                                                                                                                                      prob
1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000039
2 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002228854813387505566513180355059375870041549205780029296875000000000000000000000000000000000000000000000000000000
3 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000408777760604219059237252054117561783641576766967773437500000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000602490680256329630285275222867369393497938290238380432128906250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
5 0.0000000000000000000000000000000000000000000000000000000000000000000000000000091858681800976788285953422708018933917628601193428039550781250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 0.0000000000000000000000000000000000000000000000000089846726271129007624846596335999038274167105555534362792968750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7 0.0000000000000000000000000000000000000000000000727638909934370133351586029668567334738327190279960632324218750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 0.0000000000000000000000000000022116284781404799940351946219152523553930222988128662109375000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
  sqresid  fit RMSEA  BIC SABIC complex eChisq  SRMR eCRMS  eBIC
1   39.37 0.81  0.26  663  1766     1.0   2206 0.196 0.204   690
2   11.72 0.94  0.22  105  1123     1.2    481 0.091 0.099  -918
3    7.36 0.96  0.21   13   949     1.4    273 0.069 0.078 -1013
4    5.07 0.98  0.20  -72   785     1.8    170 0.054 0.064 -1008
5    3.50 0.98  0.19 -155   627     1.9    100 0.042 0.051  -974
6    2.51 0.99  0.17 -270   439     1.9     57 0.031 0.041  -918
7    1.41 0.99  0.17 -238   402     2.0     26 0.021 0.029  -854
8    0.84 1.00  0.15 -296   278     1.9     13 0.015 0.021  -776
Code
fa.parallel(imp)

Parallel analysis suggests that the number of factors =  4  and the number of components =  3 

VSS suggests 1 or 2 components, MAP suggests 7, parallel analysis shows 2 or 3. Let’s go with 3 for now.

Code
(pca_out <- pca(imp, nfactors = 3))
Principal Components Analysis
Call: principal(r = r, nfactors = nfactors, residuals = residuals, 
    rotate = rotate, n.obs = n.obs, covar = covar, scores = scores, 
    missing = missing, impute = impute, oblique.scores = oblique.scores, 
    method = method, use = use, cor = cor, correct = 0.5, weight = NULL)
Standardized loadings (pattern matrix) based upon correlation matrix
                            RC1   RC2   RC3   h2    u2 com
lulcDiversity              0.00  0.57 -0.08 0.33 0.672 1.0
meanAboveGrndForBiomass   -0.39  0.46  0.21 0.41 0.593 2.4
consIncomeNOps             0.93  0.09  0.04 0.88 0.123 1.0
consIncomeTotal            0.63  0.35  0.20 0.56 0.444 1.8
consIncomePF               0.24  0.53 -0.10 0.35 0.648 1.5
alleyCropSilvapastureNOps  0.18  0.62  0.55 0.72 0.282 2.2
consEasementAcres          0.08  0.34  0.87 0.88 0.121 1.3
consEasementAcresPF        0.29 -0.11  0.80 0.73 0.271 1.3
consEasementNOps          -0.17  0.74  0.45 0.77 0.230 1.8
consTillExclNoTillAcres    0.94  0.15  0.07 0.92 0.080 1.1
consTillExclNoTillAcresPF  0.92 -0.03  0.09 0.85 0.152 1.0
consTillExclNoTillNOps     0.31  0.84  0.17 0.84 0.163 1.4
consTillNoTillAcres        0.73  0.33  0.25 0.70 0.300 1.7
consTillNoTillAcresPF      0.67 -0.01  0.20 0.48 0.517 1.2
consTillNoTillNOps         0.12  0.89  0.16 0.84 0.161 1.1
coverCropExclCrpAcres      0.89  0.21 -0.06 0.83 0.168 1.1
coverCropExclCrpAcresPF    0.89  0.02  0.07 0.80 0.204 1.0
coverCropExclCrpNOps       0.24  0.86  0.07 0.81 0.190 1.2
drainedDitchesAcres        0.93  0.21  0.00 0.91 0.088 1.1
drainedDitchesAcresPF      0.94  0.09  0.02 0.89 0.107 1.0
drainedDitchesNOps         0.47  0.53  0.11 0.52 0.482 2.1
drainedTileAcres           0.71  0.06  0.30 0.60 0.403 1.4
drainedTileAcresPF         0.71 -0.05  0.31 0.61 0.394 1.4
drainedTileNOps            0.70  0.40  0.28 0.74 0.265 1.9
precisionAgNOps            0.66  0.55 -0.18 0.78 0.222 2.1
rotateIntenseGrazeNOps     0.21  0.71  0.47 0.76 0.241 1.9
fertExpenseTotal           0.83  0.32 -0.19 0.82 0.176 1.4
fertExpenseOpsWithExp      0.07  0.93 -0.04 0.86 0.136 1.0

                        RC1  RC2  RC3
SS loadings           10.72 6.70 2.74
Proportion Var         0.38 0.24 0.10
Cumulative Var         0.38 0.62 0.72
Proportion Explained   0.53 0.33 0.14
Cumulative Proportion  0.53 0.86 1.00

Mean item complexity =  1.4
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.07 
 with the empirical chi square  301.08  with prob <  0.42 

Fit based upon off diagonal values = 0.98
Code
plot(pca_out$values)
abline(h = 1)

The scree plot makes a pretty good case for 3 components here as well, as it has a nice elbow after the third.

It looks like the first component is made up of most of the conservation agriculture practices from the NASS datasets, namely acres of conservation tillage, cover cropping, and draining. Fertilizer expenses loads surprisingly strongly here too. The second component seems to have the most to do with county size or population; anything measured by the number of operations does well here, as does mean above-ground forest biomass. The last component is a grab-bag - conservation easement acres load the strongest onto it, but I don’t see a coherent pattern among metrics here.

9 Analysis

Imputation

Code
pacman::p_load(
  missForest
)
# Wrangle dataset. Need all numeric vars or factor vars. And can't be tibble
# Also removing character vars - can't use these in PCA
# Using old Connecticut counties - some lulc data is missing for them though
dat <- env_county %>%
  filter_fips('old') %>% 
  select(fips, where(is.numeric)) %>%
  column_to_rownames('fips') %>% 
  as.data.frame()
# get_str(dat)
# skimr::skim(dat)

# Remove variables with most missing data - too much to impute.
# Also remove the proportional LULC values - keeping diversity though
dat <- dat %>% 
  select(-matches('consIncome'), -matches('^lulcProp'))

# Impute missing variables
set.seed(42)
mf_out <- dat %>%
  missForest(
    ntree = 200,
    mtry = 10,
    verbose = FALSE,
    variablewise = FALSE
  )

# Save imputed dataset
imp <- mf_out$ximp

# Print OOB
mf_out$OOBerror
     NRMSE 
0.00142119 

Normalization

Using min-max normalization to put all indicators on an arbitrary 0-1 scale

Code
# get_str(imp)

# Function to normalize data
normalize <- function(x) {
    (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

# Normalize all variables
dat <- map_dfc(imp, normalize)

Now that we have normalized variables, we have to make normative decisions about what constitutes a good or bad value of those variables. This will eventually be a collaborative process where we will seek input from teams to come to some kind of consensus. But until then, I’m going to make some heroic assumptions that LULC diversity is good, above ground forest biomass is good, conservation practices and easements are good, and fertilizer expenses are bad. Open to thoughts here as always.

With that, we can recode our normalized variables accordingly.

Code
# Flip 'bad' variables by subtracting them from 1
normed <- dat %>% 
  mutate(across(c(matches('^fert')), ~ 1 - .x))
# get_str(normed)

Component Extraction

Determine the number of components to extract using a few tools: very simple structure (VSS), Velicer’s minimum average partial (MAP) test, parallel analysis, and a scree plot.

Code
pacman::p_load(
  psych
)
VSS(normed)


Very Simple Structure
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.79  with  1  factors
VSS complexity 2 achieves a maximimum of 0.94  with  2  factors

The Velicer MAP achieves a minimum of 0.06  with  8  factors 
BIC achieves a minimum of  -212.51  with  8  factors
Sample Size adjusted BIC achieves a minimum of  190.51  with  8  factors

Statistics by number of factors 
  vss1 vss2   map dof chisq
1 0.79 0.00 0.157 275  1633
2 0.76 0.94 0.081 251  1083
3 0.73 0.94 0.077 228   897
4 0.60 0.88 0.081 206   722
5 0.55 0.85 0.063 185   597
6 0.53 0.85 0.063 165   547
7 0.52 0.82 0.062 146   429
8 0.52 0.79 0.060 128   326
                                                                                                                                                                                                prob
1 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000026
2 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000098945650053446549182165070668304451828589662909507751464843750000000000000000000000000000
3 0.000000000000000000000000000000000000000000000000000000000000000000000000000000039129475698153761190588306728699308223440311849117279052734375000000000000000000000000000000000000000000000000000
4 0.000000000000000000000000000000000000000000000000000000000199731474755145863271901807145525253872619941830635070800781250000000000000000000000000000000000000000000000000000000000000000000000000
5 0.000000000000000000000000000000000000000000007276720351332410293559993519352246948983520269393920898437500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 0.000000000000000000000000000000000000000001728048955544093425503587857505749525444116443395614624023437500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7 0.000000000000000000000000000014378040855225031248269818018314936125534586608409881591796875000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 0.000000000000000000340581713295255120773491475105743120366241782903671264648437500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
  sqresid  fit RMSEA  BIC SABIC complex eChisq  SRMR eCRMS eBIC
1   34.26 0.79  0.27  476  1342     1.0 1687.8 0.205 0.214  531
2    9.70 0.94  0.22   28   818     1.2  348.0 0.093 0.102 -707
3    5.48 0.97  0.21  -61   657     1.4  170.5 0.065 0.075 -788
4    3.90 0.98  0.19 -145   504     1.7  105.4 0.051 0.062 -761
5    2.57 0.98  0.18 -181   402     1.8   55.9 0.037 0.047 -722
6    1.42 0.99  0.19 -147   373     1.8   24.9 0.025 0.034 -669
7    0.89 0.99  0.17 -185   274     1.8   12.9 0.018 0.026 -601
8    0.59 1.00  0.15 -213   191     1.8    6.6 0.013 0.020 -532
Code
fa.parallel(normed)

Parallel analysis suggests that the number of factors =  3  and the number of components =  3 

VSS suggests 1 or 2, MAP suggests 8, parallel analysis shows 3. I’m going with 3 here, which will be explained further below.

Principal Components Analysis

Using an oblique varimax rotation that allows for correlations between principal components. First we just observe the scree plot to see how the components play out.

Code
pca_out <- pca(normed, nfactors = 3, rotate = 'varimax')
plot(pca_out$values)
abline(h = 1)

This scree plot shows the eigenvalues (variance explained) of each principal component (y-axis) against each component (x-axis). The first few components explain lots of variance, but there is a decent elbow around the fourth component. Just looking at this plot, I would be tempted to take four, but the results below suggest three is more appropriate for our application.

Now we let’s look at PCA results.

Code
pca_out
Principal Components Analysis
Call: principal(r = r, nfactors = nfactors, residuals = residuals, 
    rotate = rotate, n.obs = n.obs, covar = covar, scores = scores, 
    missing = missing, impute = impute, oblique.scores = oblique.scores, 
    method = method, use = use, cor = cor, correct = 0.5, weight = NULL)
Standardized loadings (pattern matrix) based upon correlation matrix
                            RC1   RC2   RC3   RC4   h2    u2 com
lulcDiversity             -0.06  0.55 -0.20  0.38 0.48 0.516 2.1
meanAboveGrndForBiomass   -0.44  0.51 -0.06  0.31 0.55 0.449 2.7
alleyCropSilvapastureNOps  0.20  0.68  0.54 -0.14 0.81 0.186 2.2
consEasementAcres          0.01  0.41  0.81  0.12 0.84 0.159 1.5
consEasementAcresPF        0.20 -0.06  0.77  0.21 0.69 0.314 1.3
consEasementNOps          -0.16  0.80  0.32  0.02 0.77 0.233 1.4
consTillExclNoTillAcres    0.91  0.14  0.11  0.25 0.92 0.081 1.2
consTillExclNoTillAcresPF  0.89 -0.03  0.14  0.20 0.85 0.146 1.2
consTillExclNoTillNOps     0.31  0.83  0.14  0.14 0.82 0.180 1.4
consTillNoTillAcres        0.70  0.31  0.38  0.08 0.74 0.264 2.0
consTillNoTillAcresPF      0.67 -0.02  0.38 -0.10 0.61 0.392 1.6
consTillNoTillNOps         0.14  0.89  0.10  0.12 0.83 0.168 1.1
coverCropExclCrpAcres      0.93  0.21  0.00 -0.02 0.92 0.083 1.1
coverCropExclCrpAcresPF    0.90  0.02  0.17  0.04 0.84 0.157 1.1
coverCropExclCrpNOps       0.31  0.88  0.02 -0.03 0.87 0.130 1.3
drainedDitchesAcres        0.88  0.17  0.05  0.29 0.89 0.108 1.3
drainedDitchesAcresPF      0.92  0.07  0.08  0.17 0.89 0.113 1.1
drainedDitchesNOps         0.38  0.49  0.10  0.46 0.61 0.394 3.0
drainedTileAcres           0.54  0.03  0.31  0.68 0.85 0.149 2.3
drainedTileAcresPF         0.57 -0.10  0.36  0.55 0.77 0.231 2.8
drainedTileNOps            0.59  0.37  0.27  0.56 0.87 0.130 3.1
precisionAgNOps            0.64  0.52 -0.16  0.28 0.79 0.214 2.5
rotateIntenseGrazeNOps     0.21  0.74  0.48 -0.09 0.84 0.163 1.9
fertExpenseTotal           0.87  0.32 -0.15  0.06 0.89 0.112 1.3
fertExpenseOpsWithExp      0.13  0.93 -0.10  0.02 0.88 0.117 1.1

                       RC1  RC2  RC3  RC4
SS loadings           8.73 6.43 2.69 1.96
Proportion Var        0.35 0.26 0.11 0.08
Cumulative Var        0.35 0.61 0.71 0.79
Proportion Explained  0.44 0.32 0.14 0.10
Cumulative Proportion 0.44 0.77 0.90 1.00

Mean item complexity =  1.7
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.06 
 with the empirical chi square  136.54  with prob <  1 

Fit based upon off diagonal values = 0.99

Recommendations for creating composite indices are to extract components that each have eigenvalues > 1, explained variance > 0.10, and such that the proportion of explained variance for the total set is > 0.60 (Nicoletti 2000; OECD 2008).

Our total cumulative variance is explained is 0.74, and our component that explains the least variance is RC4 with 0.11. Note that extracting four or more components here gives us a component with less than 0.10, so this is why we are sticking to three. The first component (RC1) explains 38% of the variance in the data. The second component is respectable at 0.26, while the third is barely above the threshold at 0.11.

Looking at the metrics, we can see that the first component loads mostly onto the conservation practices, no-till acres, cover cropping, drainage, and total fertilizer expenses. The second component leads onto mean aboveground biomass (although there is crossloading with the first component), operations wtih silvapasture, operations with easements, rotational grazing operations, and operations with fertilizer expenses. This seems to be catching more of the population-related metrics. The last component only loads onto a few metrics: easement acres, easement acres per farm, and silvpasture operations (which has some heavy crossloading).

Back to top

References

Nicoletti, Giuseppe. 2000. “Summary Indicators of Product Market Regulation with an Extension to Employment Protection Legislation.” {{OECD Economics Department Working Papers}} 226. Vol. 226. OECD Economics Department Working Papers. https://doi.org/10.1787/215182844604.
OECD. 2008. Handbook on Constructing Composite Indicators: Methodology and User Guide. Paris: Organisation for Economic Co-operation and Development.